Add vector exporter for semantic search embeddings by marcarl · Pull Request #32 · se-lex/sfs-processor

marcarl · 2026-01-05T18:00:14Z

Adds a new 'vector' output format that converts SFS documents to vector
embeddings suitable for semantic search and retrieval. Key features:

Applies temporal filtering (like md/html mode) to include only current regulations
Intelligent text chunking (by paragraph, chapter, section, or semantic boundaries)
OpenAI text-embedding-3-large model (best quality, 3072 dimensions)
Multiple backend support: PostgreSQL/pgvector, Elasticsearch, JSON file
Integrated into sfs_processor.py with CLI options

New files:

exporters/vector/init.py - Module entry point
exporters/vector/vector_export.py - Main export functionality
exporters/vector/chunking.py - Document chunking strategies
exporters/vector/embeddings.py - Embedding provider interface
exporters/vector/backends/ - Vector store implementations

Usage: python sfs_processor.py --formats vector --vector-backend postgresql

Adds a new 'vector' output format that converts SFS documents to vector embeddings suitable for semantic search and retrieval. Key features: - Applies temporal filtering (like md/html mode) to include only current regulations - Intelligent text chunking (by paragraph, chapter, section, or semantic boundaries) - OpenAI text-embedding-3-large model (best quality, 3072 dimensions) - Multiple backend support: PostgreSQL/pgvector, Elasticsearch, JSON file - Integrated into sfs_processor.py with CLI options New files: - exporters/vector/__init__.py - Module entry point - exporters/vector/vector_export.py - Main export functionality - exporters/vector/chunking.py - Document chunking strategies - exporters/vector/embeddings.py - Embedding provider interface - exporters/vector/backends/ - Vector store implementations Usage: python sfs_processor.py --formats vector --vector-backend postgresql

Add documentation for the new vector export format including: - Overview of vector format in output formats section - Temporal processing behavior for vector format - CLI parameters for vector-specific options - Dedicated section explaining semantic search use cases - Backend comparison table (JSON, PostgreSQL, Elasticsearch) - Usage examples with mock and production embeddings

JSON backend now saves vectors to output directory instead of repository root. Sets backend_config["file_path"] when backend_type is "json". 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Standardize all metadata field names to English across vector export: - ikraft_datum → effective_date (when regulation takes effect) - utfardad_datum → issued_date (when regulation was issued) - upphor_datum → expiration_date (when regulation expires) - upphavd → repealed (if regulation is repealed) Changes: - Updated VectorRecord and DocumentChunk with English field names - Modified PostgreSQL schema with English column names - Updated Elasticsearch index mappings - Added metadata normalization from Swedish to English - Enhanced metadata extraction from both frontmatter and selex attributes - All backends (JSON, PostgreSQL, Elasticsearch) now use consistent English fields 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Change date fields from TEXT to DATE type for proper date handling: - effective_date: TEXT → DATE - issued_date: TEXT → DATE - expiration_date: TEXT → DATE Elasticsearch already uses correct date type with format "yyyy-MM-dd||strict_date". This enables proper date queries, sorting, and range filtering in PostgreSQL. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

claude and others added 5 commits January 5, 2026 17:25

marcarl merged commit 8282dcf into main Jan 6, 2026
5 checks passed

marcarl deleted the claude/vector-exporter-regulations-LR4IP branch January 6, 2026 10:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vector exporter for semantic search embeddings#32

Add vector exporter for semantic search embeddings#32
marcarl merged 5 commits into
mainfrom
claude/vector-exporter-regulations-LR4IP

marcarl commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcarl commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants